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Abstract In this paper, we propose the problem of online cost-sensitive clas¬ 
sifier adaptation and the first algorithm to solve it. We assume we have a 
base classifier for a cost-sensitive classification problem, but it is trained with 
respect to a cost setting different to the desired one. Moreover, we also have 
some training data samples streaming to the algorithm one by one. The prob¬ 
lem is to adapt the given base classifier to the desired cost setting using the 
steaming training samples online. To solve this problem, we propose to learn 
a new classifier by adding an adaptation function to the base classifier, and 
update the adaptation function parameter according to the streaming data 
samples. Given a input data sample and the cost of misclassifying it, we up¬ 
date the adaptation function parameter by minimizing cost weighted hinge 
loss and respecting previous learned parameter simultaneously. The proposed 
algorithm is compared to both online and off-line cost-sensitive algorithms on 
two cost-sensitive classification problems, and the experiments show that it 
not only outperforms them one classification performances, but also requires 
significantly less running time. 
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1 Introduction 


In pattern recognition problems, we try to design a classification function to 
predict the class label of a data sample, so that the misclassification errors of a 
set of training samples can be minimized niiiiiiiiiniisniii]. A popular assump¬ 
tion for the learning of a classifier is that the loss of misclassifying any data 
sample in the training set is equal. However, in real-world applications, dif¬ 
ferent misclassifications may result in significant different costs. For example, 
in the problem of breast cancer diagnosis, misclassifying a malignant tumor 
sample may cause much more cost than misclassifying a benign tumor sample. 
Thus it is necessary to take the costs of different types of misclassifications into 
account when a classifier is trained. This problem is named cost-sensitive learn¬ 
ing in machine learning community Given the cost setting, 

i.e., costs of different misclassifications, the target of cost-sensitive learning is 
to train a classifier so that the cost of overall misclassification can be mini¬ 
mized. In cost-sensitive binary classification, we can have different costs for 
misclassifications of positive and negative samples. In this case, misclassifying 
a positive sample to a negative sample incorrectly may results much higher 
cost than misclassifying a negative sample to a positive sample. So we must 
design a classify to correctly classify most of the positive samples, while allow 
some misclassification of negative samples. In this way, the overall misclassifi¬ 
cation cost can be minimized. Lots of cost-sensitive learning algorithms have 
been proposed to take account of different misclassification costs. For example, 
Zhou et al. [58] proposed to train cost-sensitive neural networks by using tech¬ 
nologies of sampling and threshold-moving (STM), so that the distribution of 
the training data samples can be modify, and the costs of different types of 
misclassifications can be conveyed by the appearances of the examples. Sun 
et al. m provided a comprehensive analysis of the AdaBoost algorithm re¬ 
garding its application in the class imbalance problem, and developed three 
cost-sensitive boosting algorithms (CSB), by introducing cost items into the 
learning framework of AdaBoost. Masnadi-Shirazi and Vasconcelos [3^ also 
proposed a AdaBoost-based cost-sensitive learning algorithm (ABC) to design 
cost-sensitive boosting algorithms, by considering two necessary conditions 
for optimal cost-sensitive learning, which are the minimization of expected 
losses by optimal cost-sensitive decision rules, and the minimization of empiri¬ 
cal loss to emphasize the neighborhood of the desired cost-sensitive boundary. 
Ting 133] introduced a sample-weighting method (SW) to induce cost-sensitive 
trees, by generalizing the standard tree induction process and initial instance 
weights determine the type of tree to be induced-minimum error trees or min¬ 
imum high cost error trees. Chen eta al. [10] proposed a supervised learning 
algorithm fast flux discriminant, for large-scale nonlinear cost-sensitive classi- 
hcation problems, by decomposing the kernel density estimation in the original 
feature space into selected low-dimensional subspaces. This method archives 
the efficiency, interpretability and accuracy simultaneously, and meanwhile it 
is also sparse and naturally handles mixed data types. 
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With rapid development of internet technology, more and more data is 
generated continuously, and the training set of data samples is been increased 
every day with new data samples added to the training set. Moreover, the 
cost setting can also be changed from time to time. This proposed tow new 
challenges to the cost sensitive learning problems: 

1. When the cost setting is changed, the learned classifier cannot be adapted 
to the new cost setting. A possible strategy to solve this problem is to learn 
a new classifier according to the new cost setting using the entire training 
set from the very beginning and ignores the previous learned classifier with 
previous cost setting. However this strategy is time-consuming, especially 
when the training set is large. 

When we already have a classifier learned according to a cost setting, can 
we utilize it to learn another classifier with regard to a different cost set¬ 
ting? This problem is defined as classifier adaption. Actually, classifier 
adaptation has been applied to performance measures optimization m 
and cross-domain learning [57] . In [5T] , a classifier is learned to optimize a 
performance measure, and then adapted to optimize another performance 
measure, while in [47] . a classifier is learned from a domain, and then 
adapted to a different domain. In this paper, we propose the problem of 
adapting a learned classifier to a different cost setting. 

2. When the data samples are generated and added to the training set one 
by one, the transitional cost sensitive learning methods cannot be applied, 
since they assumes that the entire training set is given to the algorithm 
once. Recently, cost-sensitive online classification (CSOC) method was pro¬ 
posed by Wang et al. [51]. This method takes the training set one by one 
and update the cost sensitive classifier online P^I491ll4j . However, CSOC is 
also constrained to fixed cost setting. When a cost setting is given, it learn 
a new classifier online, and ignores the other classifiers learned with dif¬ 
ferent cost settings. Can we learn a classifier from a base classifier trained 
with different cost sensitive setting online? This problem remains an open 
problem. 

To solve the above two problems simultaneously, in this paper, we pro¬ 
pose the hrst online cost-sensitive classier adaption method. We assume that 
we have a existed cost-sensitive classifier, and we try to adapt it to another 
classifier with regard to a different cost setting, with help of data samples 
appearing one by one in an online way. The adaptation is implemented by 
adding an adaptation function, and the it is learned by updating the adap¬ 
tation function parameter with the coming training samples with different 
misclassification costs. We construct an objective function by considering the 
respecting previous learned and minimizing cost weighted hinge loss with com¬ 
ing training samples. By solving the objective function with a gradient descent 
method and we develop an iterative algorithm. The contributions of this paper 
are of two folds: 

I. We proposed the problem of online cost-sensitive classifier adaptation. 
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2. We proposed a novel algorithm to solve this problem. 

The rest parts of this paper are organized as follows: in Section[21 we introduce 
the proposed novel method. In Section [31 the proposed method is evaluated 
on some benchmark data sets. In Section |4l the paper is concluded with some 
future works. 


2 Proposed Method 

2.1 Problem Formulation 


In this paper, instead of learning a novel cost-sensitive classifier from the 
given training set and the cost setting, we hope to use the existed classiher 
by employing the framework of classifier adaptation to learn the cost-sensitive 
classifier effectively. Suppose that we already have a classifier /o(x) learned 
without consider the different costs of misclassifications of positive and neg¬ 
ative samples, or a classifier learned with different cost setting, we want to 
adapt it to a problem with a new cost setting. To this end, we construct a new 
classifier /(x) by adding a linear adaptation function w^x to /o(x), i.e. 


/(x) =/o(x)-kw^x (1) 

where w S is the adaptation function parameter. Please note that /o(x) 
can be any type of classiher, for example, SVM, Adaboost, etc. In this way, 
we transfer the problem of cost-sensitive classiher adaptation to the learning 
of w. 

In the traditional cost-sensitive learning problem, a training data set com¬ 
posed of many positive and negative training samples are given. The cost 
factors of misclassihcation of positive and negative samples are denoted as C+ 
and C- respectively. Please note that when we train /o(x), C+ and C- are set 
to different values. The target of cost-sensitive learning is to learn a classiher 
which could minimize the overall cost of misclassihcation of the training sam¬ 
ples. However, in the online learning scene, we do not have the entire training 
data set during the training procedure. Instead, the training data samples are 
given sequentially, and the algorithm is run in an iterative way. In each it¬ 
eration, only one training sample is given, and the classiher is updated only 
with regard to this training sample. In the t-th iteration, we assume that wa 
have a training sample (xt,?/t), where Xj G is its d-dimensional feature 
vector, and yt G {+1, —1} is its corresponding class label. The corresponding 
misclassihcation cost is also given as Ct, 


Ct = 


C+,if Vi = +1; 
C-,if Vi = -1- 


( 2 ) 


Moreover, we also assume that we already learned an adaptation function 
parameter from the previous iteration Wj_i. To update w, we consider the 
following two problems. 
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— Respecting previous learned wt_i: To make the learned w consistent, 
we hope the updated W( to respect the previous Wt_i. To this end, we 
minimize the squared li distance between them, 

min i ||w - wt_i ||2 . (3) 

'w Z 

— Minimizing Cost Weighted Hinge Loss: To measure the loss of mis- 
classification, we apply the hinge loss function to (xt^yt), which is defined 
as 


^( 2 /t,/(xt)) = max( 0 ,l-j/t/(xt)) 

(4) 

= max ( 0 , l-yt (/o(xt) + w^x*)) . 

Since positive and negative samples have different misclassification costs, 
we weight the hinge loss of the t-th sample by its corresponding cost factor 
Ct, and minimize the weighted loss. 


nhn Ct X max(0, l-yt (/o(xt) + w^x*)) ( 5 ) 

By introducing a nonnegative slack variable the optimization problem 
with a cost weighted hinge loss is transferred to 


(g) 

s.t. l-yt (/o(xt) + w^xj) < C, 0 < ^. 

Here the cost factor Ct is similar to the penalty factor of SVM. However, 
we must note that this penalty factor is cost-sensitive. 

By considering the problems in JS]) and ([ 6 ]) simultaneously, we obtain the 
optimization problem for the updating of w in the f-th iteration, 

1 2 

= argmin - ||v^f - wt-iW^ + aCt^, 

^ (7) 

s.t. l-yt (/o(xt) -b vir^xt) < ^, 0 < ^. 

where a is a tradeoff parameter, and it is chosen by cross-fold validation on 
a training set. By solving this problem, we can obtain an adaptation function 
parameter wj with regard to the training sample input in the t-th iteration. 
We should note that the obtained Wt is learned from both the previous Wt_i 
and the training sample (xt,yt). Most importantly, the updating of wj relies 
on the cost of misclassification cost of (xj, yt) by considering the cost factor as 
a loss weight. When different samples come, different cost factor is used and 
the hinge loss is weighted correspondingly. 
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2.2 Optimization 

To optimize the objective function in we use the Lagrange multiplier 
method. The Lagrange function is 


£(w,^,r, A) ||w-wt_i ||2 


aCt^ 

+ T (1 - 2/t (/o(xt) + w^xt) - C) - A^. 


( 8 ) 


where r is the nonnegative Lagrange multiplier for the constrain of 1—yt (/o(xt) + w^xt) < 
and A is the nonnegative Lagrange multiplier for the constrain of 0 < ^. Ac¬ 
cording to the dual theory of optimization, the minimization of © can be 
achieved by solving the following dual problem, 


max min £(w, r. A) 

t,A w,^ 


s.t. r > 0, A > 0. 


(9) 


To solve this problem, we set the divertive of the Lagrange function £(w, r, A) 
with regard to w to zero, and we have 


— = (w - vft-i ) - Tyty^t = 0 ^ w = wt_i 


Tytxt. 


( 10 ) 


Remark: The motivation of setting the divertive of the Lagrange function 
with regard to w to zero is to solve the inner minimization problem. In equa¬ 
tion the problem is coupled with two optimization problems, which is a 
inner minimization problem, and an outer maximization problem. To solve 
this coupled problem, our strategy is first solving the inner problem, and then 
substituting the results to the objective to solve the outer problem. According 
to the optimization theory, the minimization of C with regard to w is reached 
at a solution making its divertive zero, thus we should set the divertive of the 
C with regard to w to zero to obtain the optimal w. 

Moreover, we also set its divertive with regard to ^ to zero, and obtain 

dC 

=aCt — r — A = 0 => aCt —r = A>0=>T< aCt- (11) 

Substituting results of both and (HU to the Lagrange function in (|S]), we 
can rewrite it as the function of only variable r, 


\\TytXt\\l + T 


^-yt (/o(xt) -f (wt_i -H rj/tXt)^ Xt^ 


1 


= ^t + T[l-yt (/o(xt) -H WjLiXt)] - T^x^xt 


= - + T[l-yt (/o(xt) -H w7_iXt)] . 


( 12 ) 


By setting the divertive of C{t) with regard to r to zero, we have the initial 
solution of r, 
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dC{r) 

dr 


= -tx 7 Xt + [l 


yt (/o(xt) + w7_iXt)] = 0 


=> r = 


1 - yt (/o(xt) + w7_ixt) 

X^Xt 


(13) 


Moreover, we should also note that in we have a constrain r > 0, and in 
(HU we have another constrain t < aCt- Thus the solution of r must fall in 
the following range, 


0 < r < aCt. (14) 

In this way, the solution of r* can be obtained by discussing the following three 
cases: 

1. Case I: When ^ ^ solution of r* is 

Xj' Xt - ’ 


Tt = 0, (15) 

so that the constrain r > 0 can be satisfied. 

2. Case II: When 0 < ^ aCt, the solution of t* is 

x^xt — n z 

1 - yt (/o(x0 + wT_iXt) 

n = - ^(16) 

Xt Xt 

so that the minimization of (0 can be archived. 

3. Case III: When aCt < have the solution of r* as 

'■ Xj' Xt ’ '■ 


Tt=aCt, (17) 

so that the constrain Tt < aCt can be satisfied. 

After Tt is determined, we can then update Wt using the result in (11(11) as 
follows. 


Wt = Wt-i + TtUtXt. (18) 

It could be note that the new classifier adaptation function parameter is ob¬ 
tained by adding a bias term yt^t determined by the t-th sample to the pre¬ 
vious wt_i. The bias term is weighted by a Lagrange multiplier t* which is 
further controlled by the cost factor of the t-th sample. 


2.3 Algorithm 

Based on the optimization results, we can develop an online cost-sensitive 
classifier adaptation algorithm which can take training samples one by one. 
The algorithm takes an initial classifier /(x) as an input, and operates on a 
iterative way. In each iteration, one new training sample is input to update 
the classifier adaptation function parameter, based on the updating rules in 

(ITSl) - (fT71) . and (IT51) . 
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Algorithm 1 Online Cost-Sensitive Classifier Adaptation algorithm (OC- 
SCA). 

Input: An initial classifier function /o(x); 

Input: Tradeoff parameter n; 

Initialize t = 0 and wq = 0 

while A new training sample (xt,yt) with its corresponding misclassification cost Ct is 
input do 

Compute the initial solution of Lagrange multiplier r as 


Update Tt as 


Update wt as 


1 - yt (/o(xf) -I-w7_iXf) 

x^xt 


( 0,i/ r't < 0, 

Tt = < T(,t/ 0 < Tj < aCt 

y aCt,if t [ > aCt. 


(19) 


( 20 ) 


Wt = wt_i -I- Ttyt:x-t- (21) 

Update t = t + 1', 

end while 

Output: Output the learned cost-sensitive classifier function /(x) = /o(x) + wt—i^x 


3 Experiments 

In this section, we studied the proposed algorithm experimentally. 


3.1 Data sets 

In the experiments, we used two cost-sensitive learning data sets, which are 
introduced as follows. 

3.1.1 Face detection data set 

The first data set is a face detection data set used in m- This data set is 
a large data set, and it contains 9832 face images and 9832 non-face images. 
Each face image is treated as a positive sample, while each non-face image is 
treated as a negative sample. Moreover, each image is represented as 50,000 
dimensional visual feature vector. The problem of face detection is to classify 
a given candidate image to face or non-face. Moreover, we set the cost of 
misclassifying a face to non-face as 5, and that of misclassifying a non-face to 
fact to 1. 

3.1.2 Car detection data set 

The second data set we used is a car detection data set [I]. This data set 
contains 500 car images and 500 non-car images. The problem of car detection 
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is to classify a given candidate image to car or non-car. In this problem car 
images are defined as positive images, and the non-car images are dehned as 
negative images. In this case, we set the cost of misclassification of a car image 
to 8, and that of a non-car image to 1. 


3.2 Experiment setup 

To conduct the experiment, we used the lO-fold cross validation. An entire 
data set was split into 10 folds randomly, and then each set is used as a test 
set, while the remaining 9 sets were combined as a training set. Moreover, since 
the proposed method is based on the adaption of a classifier /o(x) trained with 
different cost setting, we further split the training set to two subsets. The first 
subset contains 2 folds, and we used it to train /o(x) with different cost setting. 
For the first data set, we used the cost setting of C+ = 2 and C- = 1 to train 
/o(x), and for the second data set, used C+ = 3 and C- = 1. The second 
subset contains 7 folds, and we used it to learn w using the proposed online 
learning algorithm, by inputting the training samples of the second training 
subset to the algorithm one by one. 

The classification performances were measured by the average classification 
accuracies and the average misclassification costs. They are defined as follows. 

Average classification accuracy = 

Average misclassification cost = 

where T is the test set, y* is the predicted ^ 
yi = y*, and 0 otherwise. 


('221 

+ Vi) 

SiiXiGT ^ 

lass label, and I{yi = y*) = 1 A 


3.3 Results 

We first compared the purposed online cost sensitive learning algorithm based 
on classifier adaptation to an online cost sensitive learning algorithm without 
considering the existed classifier /o(x), and then compared it to some transi¬ 
tional cost sensitive learning algorithm. Because the proposed algorithm is the 
only method that can take advantage of /o(x), for fear comparison, when we 
used the other algorithm, both /o(x) and the 2 folds in the training set used 
to train /o(x) were ignored. 

3.3.1 Comparison to online cost sensitivity classification method 

The boxplots of the classification accuracies and misclassification of the pro¬ 
posed online cost-sensitive classifier adaptation algorithm (OCSCA) and the 
CSOC algorithm over 10-fold cross validation are given in Fig. [T] From this 
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figure, we can see that the proposed method outperforms the CSOC algorithm 
on both average accuracy and misclassification cost. Especially in the case of 
misclassification cost, the proposed algorithm achieves completely lower aver¬ 
age misclassification cost then CSOC. This is because the proposed method 
takes advantage of an existing predictor learned from more data points by 
adapting it to a given cost setting. Even the existing predictor is learned ac¬ 
cording to a different cost setting. This is an strong evidence of the fact that 
classifier adaptation can benefit cost sensitive learning. 
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Fig. 1 Classification accuracies and misclassification costs of two online cost learning algo¬ 
rithms. 


3.3.2 Comparison to off-line cost sensitivity classification method 

We also compare the proposed OCSCA to four most popular off-line cost- 
sensitive learning algorithms, which are STM proposed by Zhou et al. ESI, 
CSB proposed by Sun et al. [32], ABC proposed by Masnadi-Shirazi and Vas- 
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concelos ESI, and SW proposed by Ting m- The boxplots the classification 
accuracies and misclassification costs are given in Fig.j^l It is clear that in both 
the two figures, the proposed algorithm outperforms the compared algorithms 
on both classification accuracies and misclassification costs. The outperform¬ 
ing is even more significant on the misclassification costs. A main reason for 
this phenomenon lies on the fact that the proposed OCSCA algorithm starts 
learning from a base classifier /o(x), and then adapt it to the given cost set¬ 
ting via a training set, while the rest algorithms ignores /o(x) and directly 
learn the classifier from the training set. This means using a base classifier 
and adapting it to a training set can significantly boost the performance of 
cost-sensitive learning. Moreover, among the compared algorithms, it seems 
ABC and CSB performs slightly better than the other two ones. A possible 
reason is that they use the formula of Adaboost algorithm [3111261 1^1^. which 
performs well on detection problems. 
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Fig. 2 Classification accuracies and misclassification costs of the proposed algorithm and 
off-line cost-sensitive learning algorithms. 
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3.3.3 Running time 

An important advantage of the proposed OCSCA algorithm is its low time 
complexity compared to off-line algorithms. Thus we also compared the run¬ 
ning time of these methods and the results are given in Fig.|31 It is obverse that 
the running time of the two online learning algorithms OCSCA and CSOC is 
much less than that of the off-line learning algorithms. Both of OCSCA and 
CSOC take less than 200 seconds, while all the off-line learning algorithms 
take more than 800 seconds. This is not surprising because in each iteration, 
OCSCA and CSOC update the classifier using only one data sample, while 
the off-line learning algorithms needs to consider all the training samples. 


1600 



OCSCA CSOC ABC CSB STM SW 


Fig. 3 Running time of learning procedure of online learning algorithms and off-line algo¬ 
rithms. 


4 Conclusions and future works 

In this paper, we propose the problem of adapting an existing base classifier 
to a cost-sensitive classification problem. The base classifier is trained using 
different cost settings. Moreover, we proposed a novel online learning algorithm 
for the adaptation of the classifier. The algorithm takes one data sample at 
one time to update the adaptation parameter. The advantages of this method 
are of two folds: 

1. It can use the base classifier to boost the classification performance, and 

2 . its running time is low due to its online learning nature. 

In this work, we used the SVM as the formulation of learning. In the future, 
we will study other classification methods, such as Adaboost. We will design 
an iterative algorithm to learn the classifier online by adapting an existing 
classifier trained with a different cost setting, and the adaptation function is 
an combination of some candidate weak classifiers. In each iteration, we have 
select a weak classifier according to the classification cost of the coming data 
point, and update its weight. Moreover, the loss function of Adaboost will be 
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modified to consider the classification costs. Moreover, we will also investigate 
the application of the proposed algorithm to information security [41U4dll5dll421 
mmm, bioinformatics [4nl[24l[57] . medial imaging [561I551I54] . computer vi¬ 
sion [551[^[^l5Sll571[T^[r7lH51H^H51[T5] . reinforcement learning [?71[^ . cloud 
computing [sniisi] and microprocessor reliability modeling [SMSiiTiiaiMiiniiisD]. 
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